AITopics | image region

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

Neural Information Processing SystemsJun-19-2026, 13:01:02 GMT

While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the topranked region.

llava-1, natural language, question answering, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (1.00)

Genre:

Research Report > Experimental Study (1.00)
Overview (0.92)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.87)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.70)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.68)

Add feedback

09b6e009612875dd0a7291d5f4fd8b49-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 11:30:20 GMT

artificial intelligence, cross-image feature, machine learning, (14 more...)

Neural Information Processing Systems

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Visual Question Answering with Question Representation Update (QRU)

Ruiyu Li, Jiaya Jia

Neural Information Processing SystemsApr-22-2026, 14:34:25 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

Neural Information Processing SystemsFeb-15-2026, 14:00:19 GMT

Text-to-image generation has recently witnessed remarkable achievements.

artificial intelligence, arxiv preprint arxiv, machine learning, (12 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.46)
Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Adaptively Aligned Image Captioning via Adaptive Attention Time

Lun Huang, Wenmin Wang, Yaxian Xia, Jie Chen

Neural Information Processing SystemsFeb-15-2026, 10:09:11 GMT

AATallowstheframeworktolearn howmany attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions. AAT is deterministic and differentiable, and doesn't introduce any noise to the parameter gradients.

artificial intelligence, attention step, machine learning, (15 more...)

Neural Information Processing Systems

Country: